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Abstract. Recently, ultra high-throughput sequencing of RNA (RNA- 
Seq) has been developed as an approach for analysis of gene expression. 
By obtaining tens or even hundreds of millions of reads of transcribed 
sequences, an RNA-Seq experiment can offer a comprehensive survey 
of the population of genes (transcripts) in any sample of interest. This 
paper introduces a statistical model for estimating isoform abundance 
from RNA-Seq data and is flexible enough to accommodate both single 
end and paired end RNA-Seq data and sampling bias along the length 
of the transcript. Based on the derivation of minimal sufficient statis- 
tics for the model, a computationally feasible implementation of the 
maximum likelihood estimator of the model is provided. Further, it is 
shown that using paired end RNA-Seq provides more accurate isoform 
abundance estimates than single end sequencing at fixed sequencing 
depth. Simulation studies are also given. 

Key words and phrases: Paired end RNA-Seq data analysis, minimal 
sufficiency, isoform abundance estimation. Fisher information. 



cn 

O 



1. INTRODUCTION 

1.1 Biological Background 

All cells in an individual mammal have almost 
identical DNA. Yet, cell function within an organ- 
ism has huge variation. One mechanism that differ- 
entiates cell function is its gene expression pattern. 
Recent research has shown that this differentiation 
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may be on a fine scale: that subtle sequence vari- 
ants of expressed genes (also referred to as tran- 
scripts), called isoforms, have significant impact on 
the function of the proteins encoded by the RNA 
and hence their function in the cell (see, e.g., Wang 
et al. (2008)). The purpose of this paper is to develop 
and analyze statistical methodology for measuring 
differential expression of isoforms using an emerging 
powerful technology called Ultra High Throughput 
Sequencing (UHTS). Such study has the potential 
to help reveal new insights into cellular isoform level 
gene expression patterns and mechanisms, including 
characteristics of cell specific specialization. 

The central dogma in biology describes the infor- 
mation transfer that allows cells to generate pro- 
teins, the building blocks of biological function. This 
dogma states that DNA is transcribed to messenger 
RNA (mRNA) which is in turn translated into pro- 
teins. Recent discoveries have highlighted the im- 
portance of regulation at the level of mRNA, show- 
ing that protein levels and function can be regu- 
lated by subtle differences in the sequence of mRNA 
molecules in a cell. 

In bacteria, short DNA sequences are transcribed 
in a one to one fashion to mRNA. This mRNA is re- 
ferred to as a gene or a transcript. Like DNA, each 
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Fig. 1. A gene (DNA sequence) with three exons. During 
transcription, two isoforms are generated. The first isoform 
contains all of the gene's three exons. The second isoform 
contains the first and third exon, skipping the middle exon. 
This process is called alternative splicing and the middle exon 
is called an alternatively spliced exon. 

mRNA is a string of nucleotides, each position tak- 
ing four possible values. Mammalian cells commonly 
generate a large class of mRNA molecules from a sin- 
gle relatively short DNA sequence. The set of such 
mRNA molecules are called isoforms of a gene. This 
paper concentrates on one common mechanism gen- 
erating isoforms called alternative splicing. An ex- 
ample of alternative splicing is depicted in Figure 1 : 
two isoforms can arise from the same gene when the 
DNA, which is comprised of three sequence blocks 
(called exons), can be transcribed into two different 
mRNA molecules: one of which contains all three 
exons and one of which only contains the first and 
third exon. As this example shows, isoforms typical- 
ly have highly similar sequence. Despite this sequen- 
ce similarity, isoforms can encode proteins which 
may have different functional roles. Further, most ge- 
nes have more than three exons, and alternative use 
of exons can give rise to large numbers of isoforms. 
Thus, it has been historically difficult for technol- 
ogy and statistical methods to allow researchers to 
distinguish between different isoforms of the same 
gene. 

1.2 Ultra High Throughput Sequencing 

Ultra High Throughput Sequencing (UHTS or sim- 
ply "sequencing") is an emerging technology which 
promises to become as (or more) powerful, popular 
and cost-effective than current microarray technol- 
ogy for several applications, including isoform esti- 
mation. When used to study mRNA levels, UHTS is 
referred to as RNA-Seq. In the past year, studies us- 
ing UHTS to study genome organization, including 
isoform expression, have been prominent (see Pan 
et al. (2008); Zhang et al. (2009); Wahlstedt et al. 



(2009) ; Hansen et al. (2009); Maher et al. (2009)) 
and featured in the journals Science and Nature 
(see Sultan et al. (2008); Wang et al. (2008)), which 
dubbed 2007 as the "year of sequencing" (see Chi 
(2008)). 

Briefly, UHTS is a method that relies on directly 
sequencing the nucleotides in a sample rather than 
inferring abundance of mRNA by measuring intensi- 
ties using predetermined homologous probes as micro- 
arrays do. Thus, the data generated from an UHTS 
experiment are large numbers of discrete strings of 
nucleotides, called base pairs (bp), which can take 
values of A, C, G or T. In 2010, each experiment 
produced tens of millions of up to lOObp reads. The 
throughput of this technology is expected to con- 
tinue its rapid growth. 

Two experimental protocols for RNA-Seq are in 
common use: (a) single end and (b) paired end se- 
quencing experiments. For single end experiments, 
one end (typically about 50-100 bp) of a long (typi- 
cally 200-400 nucleotide) molecule is sequenced. For 
paired end experiments, typically 50-100 bp of both 
ends of a typically 200-400 nucleotide molecule are 
sequenced. Using current Illumina technology, each 
time the sequencing machine is operated, eight sam- 
ples (e.g., potentially eight different catalogues of 
gene expression) can be interrogated (essentially) 
independently and tens of millions of reads are pro- 
duced in each sample. 

1.3 Related Work 

An important application and use of UHTS tech- 
nology is to quantify the abundance of mRNA in 
a cell (RNA-Seq). This is done by matching the se- 
quences generated in an UHTS experiment to a data- 
base of known mRNA sequences (called alignment) 
and inferring the abundance of each mRNA from 
the number of experimental reads (fragments of the 
original mRNA molecules) aligning to it. Sometimes, 
a statistical model is used for this estimate. Im- 
portantly, experimental steps involved in an UHTS 
experiment can affect the probability of each frag- 
ment being observed, although modeling of these 
processes is not the focus of this paper. 

The rapid technological advances in sequencing 
have spawned a large number of algorithms for an- 
alyzing sequence data (see Langmead et al. (2009); 
Trapnell, Pachter and Salzberg (2009); Trapnell et al. 

(2010) ; Mortazavi et al. (2008)), some of which aim 
to estimate mRNA abundance. To date, inference on 
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the abundance of niRNA has been made by ahgn- 
ing reads to known genes and estimating a gene's 
expression by averaging the number of reads which 
map uniquely to it using the simphfying assump- 
tion that the transcript is sampled uniformly (see 
Jiang and Wong (2009); Mortazavi et al. (2008)), 
and sometimes using heuristic approaches to accom- 
modate reads which map to multiple locations (see 
Mortazavi et al. (2008)). These models do not pro- 
vide optimal estimators of isoform-specific expres- 
sion levels and do not accommodate modeling of 
important steps in the experimental procedure. The 
work in this paper significantly extends a basic Pois- 
son model developed in Jiang and Wong (2009) to 
allow for more flexible and efficient inference and es- 
tablish rigorous statistical theory. In particular, the 
model in Jiang and Wong (2009) does not work with 
paired end sequencing data, or read-specific sample 
rate in a sequencing protocol. 

This paper introduces a statistical model for es- 
timating isoform abundance from RNA-Seq data. 
By grouping the reads into categories and model- 
ing the read counts within each category as Poisson 
variables, the model is flexible enough to accommo- 
date both single end and paired end RNA-Seq data. 
Based on the derivation of minimal sufficient statis- 
tics, a computationally feasible implementation of 
the maximum likelihood estimator of the model is 
provided. Using a study of the Fisher information 
and also numerical simulation, it is shown that us- 
ing paired end RNA-Seq one can get more accu- 
rate isoform abundance estimates. To the best of 
our knowledge, this is the first such statistically rig- 
orous methodology and analysis to be developed. 

2. RNA-SEQ 

Isoforms of a gene are subtle differences in a gene 
sequence, sometimes resulting from inclusion or ex- 
clusion of a single exon, a discrete piece of sequence 
depicted in Figure 1. In principle, compared to mi- 
croarrays, UHTS has the potential to provide high 
resolution estimates of isoform use. However, signal 
deconvolution must take place for these estimates to 
be accurate. 

In order to estimate the expression of different 
isoforms of the same gene, several measurements of 
that gene's expression, whether from a microarray 
or sequencing, must be deconvolved. Several studies 
have investigated this deconvolution problem when 
measurements are made from a microarray (see Hiller 
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Fig. 2. Single end sequencing. A gene of three exons is 
shown with the middle exon of length I being alternatively 
spliced. Reads that come from this gene are shown above the 
gene in solid bars and the parts that are not sequenced are 
shown in broken lines. Reads that span an exon-exon junc- 
tion are shown in solid bars connected by thin lines. Reads 
that are related to the AS exon are shown in red color. In this 
case only the reads in red are isoform informative. 

et al. (2009) or She, Hubbefl and Wang (2009)). This 
paper presents an estimator for deconvolution for ul- 
tra high throughput sequencing experiments. 

As mentioned, two experimental approaches for 
RNA-Seq are in wide use. In single end read experi- 
ments, reads are generated from one end of a molecule 
(depicted schematically in Figure 2); in paired end 
reads, reads are generated from both ends of a mole- 
cule, but typically a large number of nucleotides in- 
terior to the molecule are left unsequenced (depicted 
schematically in Figure 3). The length of the whole 
molecule being sequenced is called the insert size or 
insert length. 

To appreciate the additional information provided 
by the paired end reads, consider Figure 2 which de- 
picts single end reads randomly sampled from a tran- 
script of a gene. Suppose there are two possible iso- 
forms for the transcript of this gene depending on 
whether an exon of length / is retained or skipped. 
In this case, only the reads that come from the al- 
ternatively spliced exon (AS exon), or come from 
junctions involving either the AS exon or the two 
neighboring exons, can provide information to dis- 
tinguish the two isoforms from each other, that is, 
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Fig. 3. Paired end sequencing. A gene of three exons is 
shown with the middle exon of length I being alternatively 
spliced. Paired end reads that come from this gene are shown 
above the gene in solid bars and the parts that are not se- 
quenced are shown in broken lines. Reads that span an ex- 
on-exon junction are shown in solid bars connected by thin 
line. Reads that are directly related to the AS exon are in red 
as before. Reads that provide indirect information for separat- 
ing isoform expressions are m green. 
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only these reads are isoform informative. If the AS 
exon is short compared to the transcript, then the 
majority of the single end reads contain information 
only on gene level expression but not isoform level 
expression. Assuming uniform distribution on the 
reads' positions in the gene, it is evident that a read 
is related to the AS exon with probability P = 
if the read comes from the AS exon inclusive iso- 
form, where L is the length of the whole gene (with- 
out the intronic regions) and r is the length of the 
reads. Thus, P is a strictly increasing function with 
respect to the read length r as well as the AS exon 
length /. As an example, for a gene of length 2000 bp 
with a short AS exon of length 50 bp, P = 0.0406 for 
reads of length 30 bp, P = 0.0513 for reads of length 
50 bp, and P = 0.0789 for reads of length 100 bp. 

Currently, technical limitations limit the length of 
sequenced reads. These limitations vary by particu- 
lar platform used for UHTS. The two platforms in 
widest use are the Illumina platform and the ABI 
SOLID platform. To date, the longest read that can 
be sequenced on the Illumina platform is roughly 
100 bp, and the most reliable read length is still 
roughly 70 bp.^ 

Paired end reads are an attractive way to decou- 
ple the isoform specific gene expression. By perform- 
ing paired end sequencing, reads are produced from 
both ends of the fragments, but the interior of the 
fragment remains unsequenced. This method of se- 
quencing both sides of the fragment increases the 
number of isoform-informative reads as illustrated 
in Figure 3. Paired end reads that are mapped to 
the genes are shown in solid bars above the gene, 
with read pairs connected by broken lines. 

As shown in Figure 3, some read pairs (colored 
red) are directly informative on the retention or skip- 
ping of the AS exon. In addition, some read pairs 
span both sides of the AS exon (colored green). For 
these read pairs, the length of the fragment that 
they span (a.k.a. the insert size or insert length) de- 
pends on whether the AS exon is used or skipped in 
the transcript. If the distribution of the insert size 
is given, then these read pairs can also provide dis- 
criminatory information on the isoforms as shown in 



^The read length is roughly the same for the ABI SOLID 
platform. For the 454 platform the read length can be several 
folds higher, but the throughput is much lower compared to 
the other two platforms. Because sequencing technology is 
developing so rapidly, these numbers are likely to be out of 
date very soon. Our statistical models apply to all platforms 
and all read lengths. 



Figure 3 and developed rigorously through the in- 
sert length model in Section 3.4.2. For illustration, 
suppose the experimental protocol selects fragments 
of sizes around 200 bp for pair-end sequencing.^ In 
such an experiment, if the insert size of a read pair 
is either 200 bp or 350 bp depending on whether 
the read pair came from a transcript that included 
or excluded an exon of length 150 bp, then this read 
pair is unlikely to have come from a transcript that 
retained the AS exon. 

It is easy to see from Figure 3 that the fraction 
of reads that contain information to distinguish the 
two isoforms from each other increases not only with 
the read length and the length of the AS exon, but 
also with the insert size (when the insert size dis- 
tribution is a point mass). Since it is possible to 
have a much longer insert size than read length,^ 
a considerable amount of information can be ex- 
tracted from the paired end reads for decoupling 
the isoform-specific gene expression. This concept 
is developed precisely in the following sections. 

3. THE MODEL 

3.1 Notation 

The notation in Table 1 is used to present the 
statistical model. 

3.2 Assumptions 

The following assumptions on the process of UHTS 
are used in this paper. 

(1) The sample contains I unique transcripts. In 
this paper we deal with one gene at a time and con- 
sider all the isoforms of the genes of interest as the 
set of / transcripts. The abundances for the tran- 
scripts are the parameters of interest and denoted 

(2) After sequencing the sample, there are J dis- 
tinct reads denoted as {sj}j^i. A type of read refers 
to a single end read that is mapped to a specific po- 



^The insert size can be controlled by tuning the parame- 
ters involved in the fragmentation, random priming and size 
selection steps in the sample preparation process. 

^Current technology allows a biochemical modification of 
sequenced molecules (via a circularization step) that can pro- 
duce two short reads from two physical locations on a molecule 
that may be separated by up to several kilobases (using the 
ABI platform or a long-insert protocol from Illumina), which 
is also called the mate-pair sequencing. Although technologi- 
cally it is different from the paired end sequencing, the anal- 
ysis is the same from a statistical point of view. 
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Table 1 
Notation 



Symbol 


Meaning 


T 
1 


Total number or unique transcripts (nucleotide sequences) in the sample. 


J 
J 


Total number of unique reads. 




' 1 " ri o ci r»n n/^ £a n i~iT + vci n or'Ti t^^" ^"irTACi i i — 1 / 
J_ lie dULliiUcLlK^t; Ui Li tlliOl^i ip L LVpfc! t, t — i_ , . . . , J . 


U 


± lie ibUilJiili tiULlilLlcLiiCfc; VfcJCLUi 1 (71 , (72 ; . . . ; "/J . 


Sj 


Read type j, j = 1,... , J. 


Tlij 


The number of reads Sj that are generated from transcripts i. 


rij 


The number of read Sj that are generated from all the transcripts, 




that is, Uj = ^ij ■ 


Q'i,j 


Up to proportionality, the sampling rate of riij, that is, the rate that 




read Sj is generated from each individual transcript i. 


aj 


The sampling rate vector [aij,a2,j, ■ ■ ■ ,aij] for read Sj. 


e-aj 


The sampling rate of nj , that is, the rate that read Sj is generated 




from all the transcripts. 


A 


The I X J matrix of the sampling rates {aij}^!^-^ 


Ci 


The number of copies of the ith transcript in the sample. 


li 


The length of the ith transcript in the sample. 


n 


The total number of reads. 



sition (which can be denoted as the 5' end of the 
read) in a transcript in single end sequencing, or 
a pair of reads that are mapped to two specific posi- 
tions (which can be denoted as the 5' end of the first 
read and the 3' end of the second read) in paired end 
sequencing. 

(3) Each transcript is independently processed 
and then sequenced. 

(4) Hi J, the number of reads of type sj that are 
generated from transcript i, are approximated as 
Poisson random variables with parameter OiUij, whe- 
re ttij is the relative rate that each individual tran- 
script i generates read Sj, called the sampling rate 
defined below. 

(5) Given {9i}i<i<i, {nij}i<i<i^i<j<j are inde- 
pendent random variables. 

If transcript i cannot generate read Sj, aij is set 
to zero: ajj = 0. More specifically, for 1 <ii,i2^ I ■, 
1 ^ 31^32 ^ J 1 assuming none of the CLi^^j^ for A; = 1, 2 
are zero, cn^^j^ are defined so that 

= Pr(read Sj^ observed after 

^12 ,32 

processing one copy of transcript ii) 

/Pr(read Sj^ observed after 

processing one copy of transcript 12)- 

Therefore, up to a multiplicative constant, aij 
is the sampling rate of the jth read from the ith 



transcript. This constant is chosen so that the esti- 
mates of Oi are normalized in order to be comparable 
across experiments. Two such choices are described 
in Section 3.4. With appropriate choice of Ojj, the 
probabilistic interpretation of j can be maintained 
across different experiments.^ 

Example 1 . Suppose a gene has three exons and 
two isoforms, as shown in Figures 2 and 3. Suppose 
the three exons have lengths 200 bp, 100 bp and 
200 bp. Suppose the read length is 50 bp and single 
end reads are generated from a transcript uniformly. 
There are totally 500 different reads. 302 of them 
are from regions shared by the two isoforms, 149 of 
them are from isoform 1 only and 49 of them are 
from isoform 2 only. In this case, 1 = 2, J = 450 and 
the matrix A, up to a multiplicative constant, is 

A 1 ••• 1 1 1 ••• 1 ••• 0\ 
\l 1 ••• 1 ••• 1 1 ••• ly' 

where A has 302 columns of (j^) , 149 columns of (q) 
and 49 columns of 



'''The implementation of the model described in this paper 
ignores reads that align to multiple genes (while of course not 
ignoring reads that align to multiple isoforms). This detail 
does not impact the significant number of genes which con- 
tain no such reads that map to multiple genes, and a simple 
adaptation of the model can accommodate reads mapping to 
multiple genes. 
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3.3 Likelihood Function 

The challenge of estimating isoform abundance 
arises from the fact that different isoforms of a gene 
can have common sequence characteristics and, there- 
fore, different isoforms may generate common read 
types. Thus, the riij's cannot be directly observed. 
Rather, the observed quantities are sequences that 
are necessarily collapsed over the potentially multi- 
ple transcripts generating them. The observed quan- 
tities in an RNA-Seq experiment are therefore rij, 
where 

/ 

i=l 

denoted as rij for simplicity. 

Since {'^ij}i<i<7, i<j<j are assumed to be inde- 
pendent, and it is assumed that the number of reads 
of type Sj that are generated from transcript i fol- 
lows a Poisson distribution with parameter Oittij, 
rij follows a Poisson distribution with parameter 
X]i=i = • aj, where 6 is the vector of isoform 
abundance [9i,92, ■ ■ ■ , Oi] and aj is the vector of sam- 
pling rates [aij,a2j, ■ ■ ■ ,ajj] for read Sj, in which 
there is a component for each isoform. 

Under the assumption that each read is indepen- 
dently generated, given {nj}j^^ are inde- 
pendent Poisson random variables, and therefore ha- 
ve the joint probability density function 

(a „ Yrt, „— 

(2) fB{n,,n,,...,nj) = J[^^^^^ . 

Note that since E(nj) = 9 -Uj = X^^^^ OiO-ij, for all 
i, J, 0, the density (2) is a curved exponential family: 
the natural parameter of the model is in M'^ while 
the underlying parameter is in with J > I. 

3.4 Statistical Models for the Sampling 
Rate: a^j 

This paper focuses on two choices of aij and il- 
lustrates the assumptions and interpretation of the 
resulting {Oi}^^^ parameters. The two choices give 
rise to two different models: the first is the uniform 
sampling model, and the second is the insert length 
model. 

While these models differ by whether insert length 
is taken into consideration, both are motivated by 
the same model of sample preparation below. To fa- 
cilitate such modeling, the biochemical steps prepar- 
ing a sample for sequencing are represented schemat- 
ically as the composition of the following: 



1. Transcript fragmentation: each full length 
mRNA is fragmented at positions according to a Pois- 
son process with rate parameter A.^ 

2. Size selection: each fragment is selected with 
some probability depending on only its length. 

3. Sequence specific amplification or selection: 
each sequence is amplified or further selected based 
on sequence characteristics. 

The sampling rate matrices A for the uniform 
sampling model and the insert length model pre- 
sented below are approximated from the same sta- 
tistical model for steps (a) and (b) above. Namely, 
transcript fragmentation (positions where the tran- 
script is cut) is modeled as a Poisson point process. 
Let p(-) denote the probability mass function of frag- 
ment lengths obtained from this process. Note that 
p{-) is an unobserved quantity because the sample 
is subject to a size selection step after fragmenta- 
tion and before sequencing. The size selection step is 
modeled as follows: a length / fragment of transcript 
is obtained with probability r(Z) independently of 
the identity of the molecule, r(-) is called the filter- 
ing function. 

While the model in steps (a) and (b) are realis- 
tic across experiments, modeling step (c) is more 
involved and variable across experiments. Model- 
ing how the specific nucleotide sequences affect the 
probability of being amplified and selected for se- 
quencing varies significantly by experiment and is 
beyond the scope of this work. However, it is impor- 
tant to emphasize that the model presented in this 
section is flexible enough to account for estimation 
of the effect of step (c). Moreover, the model can be 
adapted to accommodate different model choices in 
any of steps (a), (b) or (c). In the two models pre- 
sented below, it is assumed that sequence selection 
and amplification are uniform. 

Modeling the random processes (a) and (b) above 
as independent and only dependent on a fragment's 
length and assuming that sequence selection and 
amplification are uniform produces a model for the 
distribution of fragment lengths in the sample. This 
distribution is represented by q{-) and can be es- 
timated empirically from a paired end sequencing 
run, namely, mapping both pairs from each read to 



'^Because genomic coordinates are discrete, the occurrence 
times in the Poisson process should be rounded to the nearest 
natural numbers. 
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a database and inferring the insert length. Such an 
empirical function q{-) is depicted in Figure 5 and 
represents a reasonable approximation to the over- 
all distribution of molecule sizes sequenced in an 
experiment. Further, note that a consequence of the 
modeling in steps (a)-(c) above produces the iden- 
tity q{l)=r{l)p{l). 

Some mapping programs (such as introduced in 
Langmead et al. (2009)) have options that take ad- 
vantage of a user specified expected insert size to 
help improve mapping performance, which may lead 
to biases in the mapping. The mapping procedure 
described in this manuscript performs each paired 
end alignment by aligning the first and second read 
separately, which does not bias the insert length mo- 
del and allows for the calculation of minimal suffi- 
cient statistics for the model and to perform statisti- 
cal inference on isoform abundance without such bias. 

3.4.1 Uniform sampling model The uniform sam- 
pling model is appropriate for single read data. It 
assumes that during the sequencing process, each 
read (regarded as a point) is sampled independently 
and uniformly from every possible nucleotide in the 
biological sample. Uniform sampling is a good ap- 
proximation to sampling from a Poisson fragmen- 
tation process and subsequent filtering step when 
the filtering function r(-) has support on a set that 
is small compared to the transcript lengths; under 
these conditions, the process is approximately sta- 
tionary. 

To investigate if the uniform sampling model sat- 
isfactorily approximates the Poisson fragmentation 
and filtering above for numerical regimes of tran- 
script length and fragmentation rate encountered 
in sequencing, the following three simulations were 
performed: reads were generated from 10, 100 or 
1,000 copies of a transcript of length 2,000 bp with 
A = 0.005. All the fragments of length 200 ± 20 bp 
were retained and the fragment ends were then com- 
pared to the sampled read positions as modeled by 
the uniform sampling model (see Figure 4) . It can be 
seen that as the sample size increases, the two mod- 
els are very similar except at the two ends of the 
transcript. At the two ends the Poisson process has 
some boundary effects, and the sequencing protocol 
cannot be explained by a simple model. For most 



'^In the traditional bioinformatics literature this is also 
called alignment, while the nomenclature "mapping" is more 
often used in the UHTS literature where the sequences being 
aligned are short reads. 



situations, these effects will be small, and hence are 
ignored in the uniform sampling model. 

Thus, the uniform sampling model is appropriate 
for sequencing single short reads where the sequenc- 
ing process can be regarded simple random 
sampling process, during which each read (regarded 
as a point) is sampled independently and uniformly 
from every possible nucleotide in the sample. The 
assumption of uniformity implies that a constant 
sampling rate for all aij > is used. Specifically, 
let Cij = if transcript i cannot generate read Sj, 
and otherwise, aij = n, where n is the total number 
of reads. As seen below, n serves as a normalization 
factor. 

To motivate this choice of aij, consider the inter- 
pretation of {9i}l^i induced by A. Under the uni- 
form model, the (unobserved) counts from the jth. 
nucleotide which is generated from the ith transcript 
are modeled as a Poisson random variable with pa- 
ramete aijOi, that is, 

=Po{aij6i). 

Computing E(nj j) using the uniform sampling mo- 
del with n total reads, 

E(raj j) = nPr(jth nucleotide generated by transcript i) 

Ci 

where li is the length of the ith transcript and Cj 
is the number of copies of the ith transcript in the 
sample. Thus, setting Uij = n iff transcript i can 
generate read j produces the identity 



so the uniform sampling model has parameter 

Yli ^i^i 

This choice of A has the property that it normalizes 
{9i}l^i so that 

^^j/j = 1, 

i 

that is, it normalizes 6i as a fraction of the total 
nucleotides sequenced, as shown in Jiang and Wong 
(2009), making it conceptually compatible with the 
RPKM (Reads Per Kilobase of exon model per Mil- 
lion mapped reads) normalization scheme in Mor- 
tazavi et al. (2008), which is widely used by the 
RNA-Seq community. This normalization conven- 
tion assumes the number of nucleotides in the se- 
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Fig. 4. Uniform Q-Q plot with sampled read positions, (a), (b) and (c) are generated by simulations with 10, 100 and 1,000 
copies of transcripts, respectively. 



quenced RNA of each cell does not vary between 
samples. Modifying these assumptions to be more 
realistic yields better choices for normalizing con- 
stants (see, e.g., Bullard et al. (2010)) and can easily 
be incorporated into the normalization of the sam- 
pling rate vector. 

3.4.2 Insert length model This model is applica- 
ble to paired end sequencing data. In paired end 
sequencing, the insert length is usually controlled to 
have a small range. Therefore, as suggested in Fig- 
ure 3, besides read positions, information can also be 
extracted from insert lengths inferred from reads. By 
modeling insert lengths properly, this piece of infor- 
mation can be utilized and statistical inference can 
be improved. Example 2 below illustrates this con- 
cept and Section 6 quantifies the gain in statistical 
efficiency using the pairing information. 



The insert length model models the sampling of 
transcripts, conditional on insert length, as uniform. 
The insert length model sets each a^j- using the em- 
pirical distribution of the insert lengths of the sam- 
ple (see Figure 5) such that conditional on the in- 
sert length, reads are sampled from transcripts uni- 
formly. This is specified mathematically as 

(3) aij=q{lij)n, 

where lij is the length of corresponding fragment of 
Sj on the ith. transcript, n is the total number of 
read counts and q{l) is the probability of a fragment 
of length I in the sample after the filtering. In appli- 
cation, for the insert length model, q{-) is taken as 
q{-), the empirical probability mass function com- 
puted from all the mapped read pairs. A typical 
mass function is illustrated in Figure 5. Although 
usually this function is unimodal (as in this case), 
which favors our isoform estimation approach, our 




approach is flexible enough to allow other types of 
functions, such as bimodal functions, etc. 

To see the relationship between this choice of sam- 
pling rate matrix and a model where reads are sub- 
ject to Poisson fragmentation and length dependent 
filtering, suppose that paired end read sj is mapped 
to transcript 1 at coordinates {xi,yi) and transcript 
2 at coordinates (x2,y2) and both reads are in the 
forward direction. Then, assuming none of xi,X2,yi, 
1/2 is at the boundary of a transcript, under the Pois- 
son fragmentation model (a) and length dependent 
size selection (b), 

Pr(read Sj observed after 

processing one copy of transcript ii) 

/Pr(read Sj observed after 

processing one copy of transcript ^2) 

= Pr(cut at xi,yi, no intermediate cut, 

and transcript of length xi — yi retained) 

/Pr(cut at X2,y2, no intermediate cut, 

and transcript of length X2 — 2/2 retained) 

= Pr(tr. of length xi — yi retained | 

cut at xi,yi, no int. cut) 



• Pr(cut at xi,yi, no int. cut) 

/Pr(tr. of length X2 — y2 retained | 

cut at X2,y2, no int. cut) 

• Pr(cut at X2,y2, no int. cut) 
^ r{\xi -yi\)p{\xi -yi\) 

r{\x2-y2\)p{\x2-y2\) 

^ q{\xi -yi\) 
q{\x2-y2\)' 
Thus, the ratio 

<^ii ,j 

is approximately the same as defined by the sam- 
pling rate matrix A for the insert length model, with 
the assumption that none of xi,X2,yi or 7/2 is on the 
boundary of the transcript. As long as the insert 
length distribution has support which is small com- 
pared to transcript length, relatively few transcripts 
map exactly to the boundary, and little data is lost 
by ignoring them; doing so allows the above con- 
ditions to be satisfied. Further, the argument above 
shows that the insert length model is consistent with 
assumptions (a)-(c) of the sample preparation. 
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The insert length model yields a similar interpre- 
tation for the normalization of as in the uni- 
form sampling model, illustrated in the following 
computation: The paired end read model specifies 
that the reads of type j from transcript i are Pois- 
son with parameter 

Uij = Fo{aij9i). 

The insert length model assumes that reads are fil- 
tered based on length independent of their sequence. 
This produces a method of estimating the expecta- 
tion of riij. The following approximates E(njj) un- 
der the insert length model: 

E('^ij) = nPr(read j observed after 

processing one copy transcript i) 

■.= nFv{AnBnC), 

where A, B and C are defined as follows. Let Y be 
a random variable representing a read in the sample 
after fragmentation. Let A be the event that Y is 
a fragment of transcript i, B the event that Y is read 
j of transcript i and C the event that y is a fragment 
of length /j J- and is observed after filtering. Using the 
product rule, 

Pr(^ r\Br\C)= Pr(5|y4 n C)Pr(C7| A)Pr(A). 

Each term is analyzed separately. Assuming uni- 
form fragmentation across the transcript and length 
dependent filtering, 

Pr(^|^nC)= ^ . 

The basic assumption of the insert length model 
is that the probability of observing a transcript of 
length li^j does not depend on the transcript and is 
equal to the empirical insert length, q{lij), hence 

Pr(C7| A) 

To estimate Pr(A), consider the random variables 
Xi, the number of fragments in the sample from 
transcript i, and X, the total number of transcript 
fragments in the sample. Then, assuming transcript i 
is sufficiently and not overly abundant in the sample, 

'XA . E(X,) 



Pr(A)=E, , , , 
^ ^ \X J E{X) 

Assuming a Poisson fragmentation model, up to 
a boundary effect which has small impact on the 
approximation, 

EjXj) ^ ah 

E(X) 



Combining these approximations yields 

1 C-ili 



C-ili 



Thus, if — is close to 1, is identified in this 
model as 



Thus, in both models, the choice of ajj is consis- 
tent with its definition in equation (1). To illustrate 
the difference between the insert length and uniform 
sampling models, consider the following example: 

Example 2. Consider a case of two isoforms la- 
beled 1 and 2 with an alternative included exon as 
in Figure 1. Suppose the middle exon 2 has length 
50 for concreteness. Suppose pair end read Sj has an 
imputed length of 50 when mapped to 2 and of 100 
when mapped to 1, as will be the case if one of the 
ends is in exon 1 and one in exon 3. Suppose the 
empirical insert length function is modeled as uni- 
form [60, 140). Then, in the uniform model, because 
n total reads have been sequenced and mapped, 



aij — a2j 



whereas in the insert length model, 



aij 



n 
80 



and 



a2j 



0. 



Note that although the denominator 80 in aij 
in the insert length model seems arbitrary, because 
there are 80 different paired end reads that start at 
the same position as Sj, having all of them in the 
model gives consistent gene expression estimates as 
in the uniform model. 

3.5 Maximum Likelihood Estimation 

In this paper is estimated using the MLE. Stan- 
dard theory shows that the MLE of model (2) will 
be consistent provided the parameters in the model 
are in the interior of the parameter space (see The- 
orem 6.3.10 of Lehmann (1998)). Computationally 
efficient procedures are needed to solve for these es- 
timates in practice. 

The fact that the density (2) is Poisson allows 
for a simplification of the calculation of the MLE 
by regarding the parameter estimation as a general- 
ized linear model (GLM) problem with Poisson den- 
sity and identity link function (see McCullagh and 
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Nelder (1989)) with extra linear constraints that re- 
quire ah the parameters {9i}l^i to be nonnegative. 
The optimization problem in matrix form is 

maximize log{A^ 9) — sum{A'^ 6) 

(4) 

s.t. e>o, 

where n is a J x 1 column vector for the observed 
read counts [ni,n2, ■ ■ ■ ,nj], A is & I x J matrix for 
the sampling rates {ai,j}[i/i j=i and 6 is the / x 1 
isoform abundance vector [9i,92, ■ ■ ■ ,0i]. log(-) takes 
logarithm over each element of a vector and sum(-) 
takes summation over all the elements of a vector. 

As shown in Jiang and Wong (2009), the log-like- 
lihood function 

log(£(6l)) = log(/0(ni,n2, . . . ,nj)) 

is always concave and, therefore, any linear con- 
straint convex optimization method can be used to 
solve this nonnegative GLM problem.^ 

4. SUFFICIENCY AND MINIMAL 
SUFFICIENCY 

Because J is usually very large, it is extremely in- 
efficient to work with the statistics {jij}^^ in (2) 
directly: in single end sequencing of a human or 
mouse cell, J can exceed 2,000 for a typical gene, 
and in paired end sequencing with variable insert 
length, it can easily reach 100,000. For computa- 
tional purposes, it is therefore necessary to use suf- 
ficient statistics for the likelihood function (2). Be- 
cause these statistics have an intuitive interpreta- 
tion, they are referred to as a collapsing. This sec- 
tion analyzes sufficiency and minimal sufficiency in 
model (2) and its relation to collapsing. 

4.1 SufFicient Statistics and Collapsing 

As will be shown below, sufficient statistics have 
a natural interpretation as collapsing read counts. 
Proposition 2 shows that to group reads j and k into 
the same category, it is sufficient that reads have the 
same normalized sampling rate vector (i.e., 

dj _ a-k 
\\ak\\' 

where || • || is the vector 2-norm). 



*In our experiments we used the PDCO (Primal- 
Dual interior method for Convex Objectives, http:// 
www.stanford.edu/group/SOL/software/pdco.html) package 
developed by M. A. Saunders at Stanford University. 



Such grouping of reads will be called (maximal) 
collapsings: reads with the same normalized sam- 
pling rate vector are grouped together. Intuitively, 
a maximal collapsing reduces the number of such 
groups to be as small as possible. 

Definition 1 . Let be a collection of reads 

so 

Cfc = {sji,-- • )■5imfc}l<il<j2<•••<im^^<J• 
A set C = {Ck}k^i is called a collapsing, if for any 
Cfc G C and any Sj^,Sj^ £ Cfc, 

for some positive number c. 

Furthermore, if for any ki ^ k-2 and any Sj^ G 

&Ck,. 

for any positive number c, then {Ck}k=i called 
a maximal collapsing. In a collapsing, each Ck is 
called a category. 

As will be seen in Theorem 3, the maximal collaps- 
ing gives rise to a set of minimal sufficient statistics, 
making it useful from a computational perspective. 
A real data example of such a collapsing is provided 
in Section 5. The collapsed read counts also have 
a standard statistical interpretation as the sum of 
independent Poisson random variables. Suppose cat- 
egories 

{Ck I = 1,2, . . . ,K} with Ck ^ {si,S2,...,sj} 

are nonover lapping, that is, Ck^ n = when 
^1 7^ ^2- Then, assuming each Uj follows a Poisson 
distribution with parameter 9 ■ aj, ncf, , the num- 
ber of observed reads that belong to category Ck 
(i.e., nc\ = X^SjGCfc'^i) follows a Poisson distribu- 
tion with parameter 

where o^^^ = "^Y^^i^f^^ ^^"^ 1 < i < ^k, Oj'^^ is 
the sampling rate vector of the jth read in cate- 
gory k. 

Proposition 1. The maximal collapsing is unique. 

Proof. The relation satisfied by two types of 
reads in a category in the maximal collapsing is an 
equivalence relation. This makes the maximal col- 
lapsing a grouping of reads into equivalence classes 
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which are always uniquely determined. To show a re- 
lation is an equivalence relation, it suffices to show 
that the reflexivity, symmetry and transitivity hold. 
Reflexivity: For any Sj, 



aj — aj , 

that is, Sj ~ Sj. 

Symmetry: For any Sj and Sk, 



aj = cak 



1 



Ofc = -aj, 
c 



that is, Sj ~ Sk =^ Sfc ~ Sj. 

Transitivity: For any Sj,Sk and s/, 



aj = ciak 



and 



that is, Sj ~ 



Ofc = C2ai 
Sk and Sk ' 



si 



= ciC2ai, 
Sj ~ si. □ 



To illustrate how maximal collapsing can be de- 
rived from the choice of ajj in the uniform model 
to produce the maximal collapsing, reads with the 
same normalized sampling rate vector are grouped 
into one category. Because ajj is either or n, two 
reads Sj-^ and Sj^ will have the same normalized sam- 
pling rate vector, that is, aji/||aji|| = ajj/Hajj || , if 
and only if they can be generated by the same set 
of transcripts. 

Example 3. Consider a continuation of the setup 
in Example 2. Suppose a uniform sampling model 
and suppose reads si and S2 can be generated by 
both transcripts 1 and 2, whereas read S3 can only 
be generated by transcript 1. Then 



ai = 02 



n, n 



and 



as = [n,0]. 



It is clear that the observed count vector n = 
[fii, n2, . . . , nj] is sufficient for 9. The collapsed read 
count vector is also sufficient for 6, as detailed in the 
next proposition: 

Proposition 2. For any collapsing C = [Ci, 
C2,---,Ck], the observed read count vector uc = 
[nCj , 71(72 ) • ■ ■ ) '^Cif] is a sufficient statistic for 9. 

Proof. From the definition of collapsing, con- 
sider the fcth category Ck with the re-enumerated 

reads {sf^]i<k<K,i<j<mk', the reads in category k 
are enumerated 

— \Si , •'2 ) • • • ) ''mfc J • 

Define a^'^^ to be the sampling rate vector for s^p , 

1 < J < ruk- By definition, for all 1 < j < rrik, for 
(k) 

some scalar Cj > 0, 



Grouping si and S2 together produces the max- 
imal collapsing C = {{si, S2}, {53}}, the ffist cate- 
gory containing reads that can be produced by both 
transcripts and the second category containing reads 
only generated by transcript 1. 

4.1.1 Collapsing and sufficiency Analysis of the 
likelihood function (2) shows that collapsing the reads 
produces sufficient statistics and maximal collaps- 
ings are equivalent to minimal sufficient statistics. 

Recall that a statistic T{X) is sufficient for the pa- 
rameter in a model with likelihood function f0{x) 
if 

fe{X) = h{x)ge{T{X)). 



Ak) _(k)Jk) 



Therefore, 



a (k) (k)a (k) 
9 ■ Oj —Cj 9 ■ a\ . 



Rearranging the product in the right-hand side 
of equation (2) as a product over each read by the 
category into which it falls, and denoting the ith 
read Ui and parameter 9 • ai as x^p with parameter 

9 • a - when it falls as the jth enumerated read in 
the feth category, 

/e(ni,n2,...,nj) 
J 

=n 



(5) 



K rrife 

nn 

fc=ii=i 



(fc) 



■ Oj ) 3 e J 



(k) 



K rrik (Jk)Q ^(k)-, 

nn 

fc=ii=i 

K 



(fe) 



Sk), 



k=l j=l ^j ■ 



= /i(ni, n2, . . . , nj)ge(nc^ , , • • • , ncj^), 
where, since {nj/^i = {xf^]i<j<mk,i<k<K, 

K mk (^^ik)yf^ 



h{ni,n2,...,nj) = JJ 



k=ij=i Xj 
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and 



K 



fc=i 

establishing the sufficiency of nc = [nci , ^^C2 ) • ■ • ) 

In addition to the sufficiency proved in Proposi- 
tion 2, nc is minimal sufficient if the corresponding 
collapsing C is a maximal collapsing. This is detailed 
in the next section. 

4.2 Minimal SufFiciency 

To prove that the read counts derived from a max- 
imal collapsing are minimal sufficient statistics, re- 
call the following: 

Definition 2 (Definition 6.2.13 of (Casella and 
Berger, 2002). For the family of densities fg{-), the 
statistic T{X) is minimal sufficient if and only if 

fejx) 
feiy) 

Theorem 3. In the likelihood specified by equa- 
tion (2), counts on maximally collapsed categories 
are minimal sufficient statistics. 



does not depend on 6* <^ T{X) = T{Y) 



Proof. Let T{X) be the collapsed vector of 
counts , , ■ ■ ■ , let T{Y) be the vector 

of counts ycx ■, 2/C2 ■•■ ■ ■ ■> UCk > each of which are maxi- 
mal collapsings. If T(X) = T{Y), equation (5) shows 
that the ratio of densities 

fejx) 
feiy) 

does not depend on 6. To show the reverse implica- 
tion, suppose T{X)j^T{Y). To show that 

feix) 



fe{y) 

it suffices to show that 
9e{x) 



must depend on 9, 



Ay) 



must depend on 9. 



(6) 



It is possible to simplify this ratio as 



where {ng}g^G and {nh}heH are positive numbers 
and G and H are subsets of the categories and are 
disjoint since if G and H share a common j, the ratio 
in equation (6) can be reduced. Further, since the 



collapsings are maximal, for any ai,aj appearing in 
any product in the numerator or denominator, there 
is no c so that a, = caj. Using these properties, it will 
be shown that the ratio of densities must depend 
on 9 by contradiction. 

Suppose for some (now fixed) T{X) 7^ T{Y), equa- 
tion (6) does not depend on 9 and is equal to a con- 
stant c. Note that since 9 can be the vector of all 
I's, if equation (6) does not depend on 9, c > as 
when 9 is the vector of all I's both the numerator 
and denominator of equation (6) are positive. Then 
equation (6) is equivalent to a polynomial equation 

(7) = cll{9-a,)-^-ll{9-ag)-^ 

heH geG 

\/9 € (M+)^. By basic algebraic geometry, any poly- 
nomial in 9 which is identically zero in the space 
(M"'") is identically zero in all of M^. Therefore, the 
last step is to show that the right-hand side of equa- 
tion (7) is not actually zero for some 9 ^M.^ . To pro- 
ceed, fix /i S i?. The claim is that there exists u G 
with {v, ah) = but \/g G G, 

This V will be the choice of 9 producing the con- 
tradiction. For a vector z G M^, let z'^ denote the 
(/ — l)-dimensional subspace of vectors orthogonal 
to it. Then, to finish the proof, it suffices to show- 
ing that there is some vector in aj^ which is not in 
Ugeo'^g^- equivalent to show there is a strict 
containment 

l\Jaj]nai^=\J{ajnai)c a^ 

\eG ^ geG 

Strict containment follows since for any h ^ H, 
aj^naj 

is a subspace of dimension at most 1 — 2, thus, a count- 
able union of such spaces cannot equal a subspace 
of dimension I — 1. □ 

Using Theorem 3, the optimization problem [equa- 
tion (4)] is reduced to 

maximize n'^ logiA^ 9) — sumiA^ 9) 

(8) 

s.t. 9>0, 

where n is a x 1 column vector for the collapsed 
read counts for categories Ci , C2 , . . . , Ck , ^ is a / x 
K matrix for the collapsed sampling rates and 9 is 
the isoform abundance vector. 

The next section illustrates the relationship of min- 
imal sufficient statistics to raw data observed in se- 
quencing experiments. 
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5. APPLICATION 

This section illustrates how minimal sufficient sta- 
tistics are calculated in an example with real RNA- 
Seq data from an experiment on cultured mouse 
B cells. After the sequencing reads were generated, 
they were mapped to a database of known mouse 
mRNA transcripts using the RefSeq annotation data- 
base (see Pruitt, Tatusova and Maglott (2005)) and 
the mouse reference genome (mm9, NCBI Build 37). 
The reads were mapped using SeqMap, a short se- 
quence mapping tool developed in Jiang and Wong 
(2008). The two ends of the paired end reads were 
mapped separately and then a filtering step was ap- 
plied during which only the pair of reads which were 
mapped to the same transcript and on the right di- 
rection were retained. Further, in the analysis of this 
section, reads that map to multiple genes were also 
discarded for computational ease. Because we are 
mapping the reads to transcript sequences rather 
than the whole genome, the positions that cannot 
be uniquely mapped are less than 1%, which is not 
likely to change our results significantly. Of course, 
the model presented in Section 3 can accommodate 
reads which map to multiple genes because of the 
statistical equivalence of this problem to that of de- 
convolving the expression levels of multiple isoforms. 
We have chosen not to implement this approach be- 
cause only a small number of genes are impacted 
and because as rapid growth of the technology con- 
tinues to produce longer reads, the problem will be- 
come negligible. A total of 2,789,546 read pairs (32 
bp for each end) passed the filtering. The empirical 
distribution of the insert length was inferred. This 
distribution has a mean of 251 bp and a single mode 
of 234 bp (See Figure 5). 

Because more than 99% of the data have an in- 
ferred length between 73 bp and 324 bp, reads out- 
side of this range are not considered in subsequent 
analysis for this example, as it is likely these reads 
come from unannotated isoforms. This resulted in 
27,118 (about 1%) read pairs being excluded and 
the rest 2,762,428 (denoted as n below) read pairs 
were used in the computation. 

The mouse gene Rnpep is used to demonstrate the 
computation of minimal sufficient statistics. Rnpep 
has an alternatively spliced exon which gives rise to 
two different isoforms (see Figure 6). The gene itself 
is an amino peptidase, meaning that it is used to de- 
grade proteins in the cell. After mapping, 116 read 
pairs were assigned to this gene, out of which 113 



read pairs were used in the computation after outlier 
removal. Figure 6 presents the positions where the 
reads are mapped. The gene was picked because it 
has two alternatively spliced isoforms with a struc- 
ture that makes distinguishing reads from each iso- 
form challenging, and because the number of reads 
was small enough to visualize all of them in a simple 
figure. 

5.1 Uniform Sampling Model 

Any paired end read experiment can be treated as 
a single end read experiment by taking each paired 
end read and treating it as two distinct single end 
reads, one from each side of the pair. In this, the 
113 paired end reads become 226 single end reads 
(without pairing information). 

In the uniform sampling model, for either isoform, 
the sampling rate vector for each read sj can take at 
most two values: 2n when the isoform can generate 
read j and when it cannot. Because there are only 
two isoforms, one of which (isoform 2) excludes one 
of the exons of the other (isoform 1), it is evident 
that in the uniform sampling model, there are only 
three categories for the two isoforms. 

The total length of isoform 1 is 2,300. The to- 
tal length of isoform 2 is 2,183. Hence, comput- 
ing aij by summing over the sampling rate vectors 
of the reads in the same category, the three cat- 
egories can be represented by their sampling vec- 
tors: [4,242n,4,242n], [296n,0], [0,62n]. Using mini- 
mal sufficient statistics reduces the data from a vec- 
tor representing counts on the 2,300 possible reads 
Sj from the two isoforms to the 3 minimal sufficient 
statistics which are counts on these categories. 

The three categories representing minimal suffi- 
cient statistics are tabulated in Table 2. Each cat- 
egory refers to a group of reads that is generated 
by a particular set of isoforms. For example, cate- 
gory 1 consists of reads generated by both isoforms 
and category 3 consists of reads generated by iso- 
form 2 only. Using these statistics to solve the opti- 
mization problem (4), the MLE for the two isoforms 



Table 2 

Single end read categories for Rnpep 



Category ID 


Sampling rate vector 


Read count 


1 


[4,242n,4,242n] 


216 


2 


[296n,0] 


10 


3 


[0, 62n] 
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Fig. 6. Visualization of RNA-Seq read pairs mapped to the mouse gene Rnpep in the CisGenome Browser (see Jiang et al., 
2010). From top to bottom: genomic coordinates, gene structure where exons are magnified for better visualization, read pairs 
mapped to the gene. Reads are 32 bp at each end, A read that spans a junction between two exons is represented by a wider 
box. 



is [^1,^2] = [15.47,2.70].^ Bayesian credible intervals 
for these estimates can be obtained by sampling 
from the posterior space of the parameters (as out- 
lined in Jiang and Wong (2009)), the marginal 95% 
credible intervals for 9i and 02 are (7.89, 18.81) and 
(0.25,10.83), respectively. 

5.2 Insert Length Model 

To visualize how the insert length model can be 
used to produce potentially stronger statistical in- 
ference as compared to the uniform sampling model. 



® All the expression estimates in this paper are in units com- 
patible with RPKM (Reads Per Kilobase of exon model per 
Million mapped reads) (see Mortazavi et al. (2008)). 



consider Figure 6. Each paired end read is depicted 
by two boxes with arrows joining pairs of reads. The 
direction of the arrows represent which side of the 
read was sequenced first. For those interested, the 
direction of the arrows in the Rnpep gene itself in- 
dicates the transcriptional direction of the gene in 
genomic coordinates, although this concept can be 
ignored for the purposes here. Note that there is no 
direct evidence that isoform 2 is present in the sam- 
ple, as no read crosses the junction between the two 
exons which are adjacent in isoform 2 but not in iso- 
form 1. There is direct evidence of the presence of 
isoform 1, for example, as depicted in the fifth read 
from the left in the first row which directly crosses 
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a junction between two exons only adjacent in iso- 
form 1. 

Because of the small gap between exons in the 
figure, reads spanning exons will be slightly longer 
than reads not spanning exons. Also, some inserts 
are very short, and absence of the arrow connecting 
two reads indicates that the entire insert has been 
fully sequenced. Note that several of the reads span- 
ning the alternatively spliced exon are exceedingly 
long. This suggests that such reads are actually gen- 
erated from isoform 2 rather than isoform 1. If such 
reads are generated from isoform 2, they would likely 
have a smaller insert length than the inferred insert 
length when generated by isoform 1, which are the 
lengths depicted in the figure. Because the empirical 
insert length distribution has its only mode near 250 
bp, conditional on observing the 6th and 7th reads 
from the top of the figure spanning the alternatively 
spliced exon, the read is more likely to come from 
isoform 2. Thus, there is indirect evidence of the 
presence of isoform 2 in the sample. 

Such indirect evidence is utilized by the insert 
length model; the model produces quantitative esti- 
mates of the relative abundance of the two isoforms. 
As will be seen in the next section, the abundance 
estimates from the insert length model have larger 
Fisher information than the estimates from the uni- 
form sampling model. 

In the insert length model, each of the possible in- 
sert lengths where q{-) has support produce a unique 
read Sj yielding a total of 569,205 possible reads 
from the two isoforms. The maximal collapsing pro- 
duces a total of 138 categories, some of which are 
represented in Table 3. For intuition, all of the reads 
with a fixed insert length where both ends fall in the 
leftmost 7 or rightmost 3 exons of Rnpep will be in 
the same category, as they have the same probability 
of being sampled. 

Using the minimal sufficient statistics, the MLE is 
computed to be [6*1, 6*2] = [16.73,3.43]. The marginal 

Table 3 
Paired read categories for Rnpep 



Category ID 


Sampling rate vector 


Read count 


1 


[1,681.82?!, l,681.82n] 


95 


2 


[294.60n,0] 


10 


3 


[0, 245.80n] 


2 


138 


[0.0057?!, O.OOlSn] 






95% credible intervals for Oi and 62 are (11.22, 21.02) 
and (1.03,9.29), respectively. The computed marginal 
95% credible intervals for 61 and 82 are nonover- 
lapping, whereas in the single end read model, one 
cannot conclude that the expression of isoforms 1 
and 2 differ. Further, the insert length model has 
slightly smaller marginal credible intervals for each 
parameter. 

This example suggests that although the uniform 
sampling model for single end reads has twice the 
sample size compared with the insert length model 
for paired end reads, the insert length model ac- 
tually provides estimates with smaller standard er- 
rors than those generated by the uniform sampling 
model, because the insert length model can utilize 
the extra information from the insert sizes of the 
reads. This difference can be quantified by analyzing 
the Fisher information of each model, the subject of 
Section 6. 

5.3 Practical Implementation Issues 

In general, to apply Theorem 3, one needs to enu- 
merate all the read types before collapsing, as shown 
in the example of mouse gene Rnpep. This might be 
a time consuming step, especially when the number 
of read types is large. In practice, however, under 
some suitable sampling rate models (which include 
both our uniform model and insert length model), 
it is sufficient to enumerate only the read types that 
have at least one read being mapped. This can re- 
duce the computation when the number of mapped 
reads for the gene is small, or, in other words, when 
the gene is lowly expressed. 

To see how this works, rearrange the right-hand 
side of equation (2) as follows: 

fg{ni,n2,...,nj) 

^ {6 ■ ajY^ e-^-''^ 
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nj>0 ^ i=l 

where only the term J2j=i depends on the sam- 
phng rates of read types with read counts nj = 0. 
Therefore, if we can compute this term without 
knowing each particular sampling rate Uij , the enu- 
meration of all the read types is no longer neces- 
sary. Fortunately, it is possible under some suit- 
able sampling rate models, including both our uni- 
form model and insert length model. For example, in 
the uniform model, X^j=i Ojj- = n(/j — r-|- 1), where n 
is the total number of mapped reads, li is the length 
of transcript i and r is the read length. Similarly, 
in the insert length model, S/=i = 
Eq{r)>oMr){k-r + l). 

Using this trick, we can take only the read types 
with at least one read being mapped and collapse 
them to categories Ci , C2 , . . . , Ck ■ Accordingly, the 
optimization problem [equation (4)] is reduced to 

maximize log{A'^ 9) — W'^ 9 

(10) 

s.t. 9>0, 

where n is a i^T x 1 column vector for the collapsed 
read counts for categories Ci , C2 , . . . , Ck , A is a / x 
K matrix for the collapsed sampling rates and 9 is 
the isoform abundance vector. W is a I x 1 vector 
with the ith element VFj=^^^^ ajj computed based 
on the corresponding sampling rate model. 

In a more complex sampling rate model, for exam- 
ple, when depends on the particular nucleotide 
sequence of read Sj , the optimization problem [equa- 
tion (10)] can still be solved. However, all the read 
types (including the read types with rij = 0) will 
have to be enumerated and each sampling rate Uij 
will have to be computed. 

6. INFORMATION THEORETIC ANALYSIS 

Many considerations impact the choice of sequenc- 
ing protocol in an experimental design. One such 
choice is relative cost of sequencing. In this case, 
the experimentalist may be interested in choosing 
the sequencing protocol (paired end or single end) 
that provides the best estimate of isoform abun- 
dance at the least relative cost. This section out- 
lines the statistical argument for why, in typical sit- 
uations, paired end sequencing can produce better 
estimates of transcript abundance compared to sin- 
gle end sequencing at a fixed number of sequenced 



nucleotides (cost). The theoretical analysis aims to 
show that for the same number of sequenced nu- 
cleotides, the Fisher information in the insert length 
model is more than double the Fisher information in 
the single end read model. Since estimates in RNA- 
Seq are maximum likelihood estimators, their vari- 
ance of the estimator converges to the reciprocal of 
the Fisher information. Thus, larger Fisher informa- 
tion produces estimators with improved accuracy. 

6.1 Theoretical Analysis 

Consider the following quite simple example show- 
ing the increase in information as the fraction of 
reads unique to each isoform grows: 

Example 4. Continuing Example 1, suppose 
that isoform 1 and isoform 2 have Poisson rate pa- 
rameters 9i and 92, respectively, where 92 = 1 — 9i 
and probability < a,/3 < 1, respectively, of pro- 
ducing a read unique to the isoform. Let rii be the 
reads unique to 1, n2 the reads unique to 2 and 
ns the reads which cannot be distinguished between 
the isoforms. Assume there are n total reads in the 
sample, and assume there is uniform fragmentation 
which gives rise to three categories: 

ni = Po{na9i), 

n2 = Po(n/302), 

n3 = Po(?i((l-a)0i + (1-/3)^2)). 

Fix a < 13 as known and compute the information 
in this distribution with 9i as the unknown param- 
eter as a function of a using the definition that the 
information is equal to the variance of the derivative 
of the log likelihood with respect to 9i: 

Tia \ (ni n2 n^{a-l3) \ 

V^i ^2 0i(a-^) + ^y' 

where x = 1 — x. Thus, a — /S = /3 — a, and 5 := 
(3 — a and a are re-parameterizations of /3,a. The 
last equation shows that the partial derivatives of 
the information with respect to a and with respect 
to 5 are positive. 

Note that in the example above, no generality is 
lost by assuming (3 > a since 9i and ^2 can be inter- 
changed with no effect on the model. 

To see that for a fixed cost of sequencing (num- 
ber of sequenced nucleotides) the statistical model 
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exon 1 



exon 2 
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500bp 



50bp 



SOObp 



Fig. 7. A model gene for the study of Fisher information 
and accuracy of the single end and insert length models. 

produced by paired end sequencing has more infor- 
mation than single end sequencing, it is necessary 
to show that the information obtained by twice as 
many single end reads in a single end sequencing ex- 
periment is smaller than that obtained by a paired 
end sequencing experiment. Such a comparison nec- 
essarily depends on each gene, its isoforms and their 
relative abundance. The computation of the Fisher 
information for a typical such example is presented 
below, and the computation shows that the example 
easily generalizes to other configurations of isoforms. 

Example 5. Continuing the running example, 
consider reads of length r = 100 bp and paired end 
insert size x = 200 bp in the schematic of three exons 
in Figure 1, where the length of exons 1 and 3 is 
500 bp and exon 2 is e = 50 bp (see Figure 7). 

For a single end read experiment, as is the prob- 
ability that a read includes any part of the included 
exon (i.e., uniquely identifies isoform 2), so for the 
read length of r, 



1 + e 



as 



1,000 + e-r + l 

and Ps is the probability that a read includes any 
part of the spliced junction (i.e., uniquely identifies 
isoform 1) so 



r — 1 



1,000 -r + 1 



For a paired end read experiment, with x the in- 
sert length. Op, the probability that a read uniquely 
identifies the second isoform, is 



e + X — 1 



ar. 



1,000 + e- j; + l 



and /3p, the probability that a read uniquely identi- 
fies the first isoform, is 



/3p 



x-1 



1,000 -x + 1 



For a concrete example, suppose 9i = 202. Assume 
further that there are twice as many single end reads 



(a sample size of 2n) compared to the n reads in 
a paired end run: 

T o , Qfl , (g, - 0s) \ 
/, :=2n -a, + 3/3, + _ ^ ' 

V2 {2/3){as - (3s) + PsJ 

and the information in a paired end run for a fixed 
insert size is 



T /'^ , Q« , («p - Pp) 



Plugging in numbers x = 200, e = 50, and r = 30 
gives 

0.28. 



Ip 1.12 

In other words, in the insert length model, the 
maximum likelihood estimator of 9i has asymptotic 
variance roughly 3 times larger in the single end read 
experiment than in the paired end experiment. 

Of course, this ratio will change if the parameters 
change. For instance, Is/Ip = 0.63 if x = 200, e = 
50 and r = 70; h/Ip = 0.47 if x = 200, e = 100 and 
r = 50. 

The next section gives simulation results for a re- 
lated example. 

6.2 Simulation Study 

Simulations were used to study the following ques- 
tions: (1) the quality of the proposed model at es- 
timating isoform-specific gene expression, especially 
when the insert length is variable, and (2) whether 
abundance estimates from paired end reads are more 
reliable than abundance estimates from single end 
reads. 

To address these questions, reads were simulated 
from a "hard case" where a gene has three exons 
of lengths 500 bp, 50 bp and 500 bp, respectively 
(see Figure 7); the middle exon can be skipped, pro- 
ducing two different isoforms of the gene. Since the 
middle exon is short, this case has been shown to be 
difficult for isoform-specific gene expression estima- 
tion in Jiang and Wong (2009). 

In the simulation, the two isoforms were assumed 
to have equal abundance. Reads were simulated us- 
ing different models and parameters described in de- 
tail below and estimate isoform abundances as de- 
scribed in Section 3.5. The relative error of estima- 
tion was computed based on the empirical relative 
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Fig. 8. Relative error of different read generation models. X axis is the sample size, that is, the number of reads that is 
generated in each simulation experiment. Y axis is the mean relative error based on 200 simulation experiments. The error 
bars give the standard errors of the sample means. In the figures, single 30 bp reads generated with uniform sampling rate (solid 
curves) are compared to (dashed curves) (a) single 100 bp reads, (b) single 30 bp reads generated with Iognormai sampling 
rate, (c) paired end 30 bp reads generated with Gaussian insert size and (d) paired end 30 bp reads generated with uniform 
insert size. When compared with n (e.g., 5,000^ single end reads, n/2 (i.e., 2,500j pairs of paired end reads were used. 



loss: 

where 9 = [^1,^2] = \] is the true isoform abun- 
dance vector, and 6 = [^1,^2] is the estimated iso- 
form abundance vector after normahzation so that 
^1 + ^2 = 1- Each simulation experiment was re- 
peated 200 times to get the sample mean and stan- 
dard error of the relative error. 



6.2.1 Simulating single end reads with uniform 
sampling To explore the quality of estimation in 
the uniform sampling approach, single end reads 
with length 30 bp using the uniform sampling model 
were generated. Five separated experiments were 
performed to investigate the effect of sample size 
on the estimation procedure using sample sizes of 
10, 50, 200, 1,000 and 5,000, respectively. The solid 
curve in Figure 8(a) gives the sample mean and stan- 
dard error of the relative error. It is clear that rela- 
tive error decreases as the sample size increases. 

To examine whether longer reads can provide bet- 
ter estimates, all the simulation experiments were 
repeated with read length 100 bp. Figure 8(a) shows 
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the comparison between read lengths of 30 bp and 
100 bp. As expected, 100 bp reads produce smaller 
error than 30 bp reads. 

6.2.2 Simulating single end reads with nonuniform 
sampling In real UHTS data, the read distribution 
is not uniform. To evaluate how well the RNA-Seq 
methodology performs in this regime, simulations 
were performed where the positions of reads were 
sampled from a log-normal distribution. Specifically, 
up to a scalar multiple, the true sampling rates Oij 
are independently and identically distributed ran- 
dom variables which follow log-normal distribution 
with mean /i = and standard deviation cr = 1. 

Figure 8(b) gives the comparison between reads 
that were sampled from uniform distribution and 
reads that were sampled from log-normal distribu- 
tion The figure shows that nonuniform reads pro- 
duce estimates which appear consistent, albeit with 
larger error than with uniform reads. 

6.2.3 Simulating paired end reads This section in- 
vestigates whether, in simulation, paired end reads 
can provide more information than single end reads. 
When insert lengths do not have a simple distribu- 
tion, closed form expressions for the information are 
difficult to obtain. Simulation studies are thus im- 
portant tools for analyzing such situations. For this 
purpose, paired end reads of length 30 bp with insert 
size following a normal distribution with mean /i = 
200 bp and standard deviation o" = 20 bp were gen- 
erated. For a given insert size, read pairs were gen- 
erated using a uniform sampling model. Figure 8(c) 
shows that the paired end reads produce smaller er- 
rors than single end reads with the same number 
of sequenced nucleotides: to make the comparison 
comparable on the level of total sequenced bases, 
n/2 pairs of paired end reads were used when com- 
pared with n single end reads. 

When the insert size was generated using a uni- 
form distribution, for example, the effective insert 
size is uniform within 200 it 20 bp, similar results 
were produced [see Figure 8(d)]. Comparing Fig- 
ure 8(d) with Figure 8(a) shows that paired end 30 
bp reads produce similarly accurate estimates as 100 
bp single end reads, which means that, on average, 
paired end reads provide more information per nu- 
cleotide being sequenced. 

6.2.4 Simulating with other parameters We also 
performed simulations with other settings of param- 
eters, for instance, with read length 70 bp, with 



true isoform expression vector (0.1, 0.9) or with exon 
lengths (500 bp, 200 bp, 500 bp). The results are 
shown in Figure 9. In all these simulations, the ad- 
vantage of paired end sequencing over single end 
sequencing is obvious for moderate sampling (50 < 
n < 1,000), as in typical cases for sequencing data. 

7. DISCUSSION 

The insert length model presented in this paper is 
a flexible statistical tool. The model has the capacity 
to accommodate oriented reads from Illumina data 
and to model fragment specific biases in the prob- 
ability of each fragment being sequenced. In Sec- 
tion 3.2 the model has been derived when the exper- 
imental step of fragmentation is assumed to be ap- 
proximated by a Poisson point process, and a tran- 
script is assumed to be retained in the sample in 
proportion to the fraction of transcripts of its length 
estimated after sequencing. These assumptions are 
at once simplifying and realistic. As experimental 
protocols improve, it is likely they will better model 
RNA-Seq data. 

At the current time, several improvements may be 
made to the model to increase its accuracy. First, 
the read sampling rate is undoubtedly nonuniform, 
as it depends on biochemical properties of the sam- 
ple and fragmentation process as experimental stud- 
ies have highlighted (see Ingolia et al. (2009); Vega 
et al. (2009), and Quail et al. (2008)). This effect 
becomes more apparent for longer fragments such 
as those used in paired end library preparation. Ex- 
plicit models for the sampling rates are difficult to 
obtain, but doing so is an area of future research. 
Recent research (see Hansen, Brenner and Dudoit 
(2010); Li, Jiang and Wong (2010)) has shown that 
the nonuniformity can be modeled and estimated 
quite well from the data. It may be possible to com- 
bine these models with our approach to improve the 
estimation performance. 

Statistical tests of the reproducibility of the nonuni- 
formity of reads shows a consistent sequence spe- 
cific bias across biological and technical replicates of 
a gene. This effect could be due to bias in RNA frag- 
mentation, bias in other biochemical sample prepa- 
ration steps or boundary effects when a gene of fixed 
length is fragmented. The last cause of bias can be 
modeled using Monte Carlo simulations of a fixed 
length mRNA sequence subject to a Poisson frag- 
mentation process and incorporated into the insert 
length model. 
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Fig. 9. Relative error of single end reads (solid curves) and paired end reads (dashed curves) with different settings of 
parameters: (a) 70 bp reads, true isoform expression vector (0.5,0.5) (b) 30 bp reads, true isoform expression vector (0.1,0.9) 
(c) 30 bp reads, true isoform expression vector (0.5,0.5), exon lengths (500 bp, 200 bp, 500 bp). When compared with n (e.g., 
5,000^ single end reads, n/2 (i.e., 2,500^ pairs of paired end reads were used. 



Similarly, the fragmentation and filtering steps ha- 
ve not been explicitly modeled in the insert length 
model presented here. Rather, the probability mass 
function of read lengths, what is necessary for defin- 
ing the model, has been estimated empirically. Im- 
provements to the model could be made by increas- 
ing the precision of the estimate of the probabil- 
ity mass function of read lengths, for example, by 
simulating a fragmentation and filtering process by 
Monte Carlo and matching the output of the sim- 
ulations to the empirical distribution function q[-). 
If such modeling were desired, as described in Sec- 
tion 3.2, the effects could be easily incorporated into 
the insert length model. On the other hand, as ex- 
perimental protocols improve, they may reduce this 
bias and increase the accuracy of the insert length 
model as presented in this paper. 



In reality, sequencing mapping is another step that 
may affect the analysis. For instance, some reads 
cannot be mapped because of sequencing errors and 
some can be mapped to multiple places. We have 
not focused on the issue of mapping fidelity be- 
cause we restrict attention to the reads which did 
map uniquely. We are also not taking into account 
mapping errors which themselves require statistical 
modeling. We have chosen not to model these errors 
partly because some mapping errors are platform- 
dependent (i.e., different sequencing errors tend to 
be made by the Illumina vs. other platforms). 

In some applications, the parameters of interest to 
biologists are not the RPKMs for isoforms 1 and 2, 
but rather the relative expression ratio of both iso- 
forms. One way to estimate the ratio is to reparam- 
eterize the problem with 6i as a first parameter and 
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a second parameter fi = 6i IO2 ■ The reparameteriza- 
tion win make the model no longer linear in the pa- 
rameters, therefore harder to solve. Also, the choice 
of \i is not straightforward when there are 3 or more 
isoforms. An easier way is to estimate /i indirectly 
after estimating Q\ and 02- 

We believe that technological improvements that 
produce longer read lengths will not diminish the 
relevance of the insert length model. Paired end mod- 
els will be relevant at least until read lengths are 
comparable to the length of each transcript, and 
perhaps longer for reasons of cost. Since many tran- 
scripts are larger than 10^ nucleotides, and longer 
in some important cases, such a time is unlikely to 
occur in the next few years. Further, longer insert 
lengths and reads combined with the insert length 
model in this paper will aid in discrimination of 
complex isoforms and estimation of isoform-specific 
poly-A tail lengths. Thus, we do not foresee any im- 
minent obsolescence of this model. 

While the model developed in this paper has the 
potential for great use and extends current method- 
ology for isoform-specific estimation, the model as- 
sumes that the complete set of isoforms of a gene 
have been annotated. De novo discovery of isoforms 
from a sample is an important and difficult statis- 
tical problem that we have not addressed in this 
paper. Another shortcoming of the model is that in 
order for statistical inference to be accurate, with 
the current short read technology, the number of 
isoforms should be relatively small (e.g., 2-5). We 
expect these challenges to motivate methodological 
development in the field of RNA-Seq in the coming 
years. 

In conclusion, this paper has presented a statis- 
tical model for RNA-Seq experiments which pro- 
vides estimates for isoform specific expression. Find- 
ing such estimates is difficult using microarray tech- 
nology, focusing interest in UHTS to address this 
question. In addition to modeling, the paper has pre- 
sented an in-depth statistical analysis. By using the 
classical statistical concept of minimal sufficiency, 
a computationally feasible solution to isoform esti- 
mation in RNA-Seq is provided. Further, statistical 
analysis quantifies the perceived gain in experimen- 
tal efficiency from using paired end rather than sin- 
gle end read data to provide reliable isoform specific 
gene expression estimates. To the best of our knowl- 
edge, this is the first statistical model for answering 
this question. 
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